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Abstract 

One-counter MDPs (OC-MDPs) and one-counter simple stochastic games (OC- 
SSGs) are 1-player, and 2-player turn-based zero-sum, stochastic games played 
on the transition graph of classic one-counter automata (equivalently, pushdown 
automata with a 1 -letter stack alphabet). A key objective for the analysis and 
verification of these games is the termination objective, where the players aim 
to maximize (minimize, respectively) the probability of hitting counter value 0, 
starting at a given control state and given counter value. 

Recently, we studied qualitative decision problems ("is the optimal termina- 
tion value equal to 1?") for OC-MDPs (and OC-SSGs) and showed them to be 
decidable in polynomial time (in NP n coNP, respectively). However, quantita- 
tive decision and approximation problems ("is the optimal termination value at 
least p", or "approximate the termination value within e") are far more challeng- 
ing. This is so in part because optimal strategies may not exist, and because even 
when they do exist they can have a highly non-trivial structure. It thus remained 
open even whether any of these quantitative termination problems are computable. 

In this paper we show that all quantitative approximation problems for the ter- 
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mination value for OC-MDPs and OC-SSGs are computable. Specifically, given a 
OC-SSG, and given e > 0, we can compute a value v that approximates the value 
of the OC-SSG termination game within additive error s, and furthermore we can 
compute e-optimal strategies for both players in the game. 

A key ingredient in our proofs is a subtle martingale, derived from solving 
certain linear programs that we can associate with a maximizing OC-MDP. An 
application of Azuma's inequality on these martingales yields a computable bound 
for the "wealth" at which a "rich person's strategy" becomes e-optimal for OC- 
MDPs. 



1. Introduction 

In recent years, there has been substantial research done to understand 
the computational complexity of analysis and verification problems for classes 
of finitely-presented but infinite- state stochastic models, MDPs, and stochastic 
games, whose transition graphs arise from basic infinite-state automata-theoretic 
models, including: context-free processes, one-counter processes, and pushdown 
processes. It turns out these models are intimately related to important stochastic 
processes studied extensively in applied probability theory. In particular, one- 
counter probabilistic automata are basically equivalent to (discrete-time) quasi- 



birth-death processes (QBDs) (see D9Q), which are heavily studied in queuing the- 
ory and performance evaluation as a basic model of an unbounded queue with 
multiple states (phases). It is very natural to extend these purely probabilistic 
models to MDPs and games, to model adversarial queuing scenarios. 

In this paper we continue this work by studying quantitative approximation 
problems for one-counter MDPs ( OC-MDPs ) and one-counter simple stochastic 
games ( OC-SSGs), which are 1-player, and turn-based zero-sum 2-player, stochas- 
tic games on transition graphs of classic one-counter automata. In more detail, an 
OC-SSG has a finite set of control states, which are partitioned into three types: 
a set of random states, from where the next transition is chosen according to a 
given probability distribution, and states belonging to one of two players: Max or 
Min, from where the respective player chooses the next transition. Transitions can 
change the state and can also change the value of the (unbounded) counter by at 
most 1. If there are no control states belonging to Max (Min, respectively), then 
we call the resulting 1-player OC-SSG a minimizing (maximizing, respectively) 
OC-MDP. Fixing strategies for the two players yields a countable state Markov 
chain and thus a probability space of infinite runs (trajectories). 




2 



A central objective for the analysis and verification of OC-SSGs, is the termi- 
nation objective: starting at a given control state and a given counter value j > 0, 
player Max (Min) wishes to maximize (minimize) the probability of eventually 
hitting the counter value (in any control state). From well know fact, it follows 
that these games are determined, meaning they have a value, v, such that for every 
e > 0, player Max (Min) has a strategy that ensures the objective is satisfied with 
probability at least v - e (at most v + s, respectively), regardless of what the other 
player does. This value can be irrational even when the input data contains only 
rational probabilities, and this is so even in the purely stochastic case of QBDs 
without players (DSD). 

A special subclass of OC-MDPs, called solvency games, was studied in [1] as a 
simple model of risk-averse investment. Solvency games correspond to OC-MDPs 
where there is only one control state, but there are multiple actions that change 
the counter value ("wealth"), possibly by more than 1 per transition, according 
to a finite support probability distribution on the integers associated with each 
action. The goal is to minimize the probability of going bankrupt, starting with a 
given positive wealth. It is not hard to see that these are subsumed by minimizing 
OC-MDPs (see fl). It was shown in yj] that if the solvency game satisfies a 
number of restrictive assumptions (in particular, on the eigenvalues of a matrix 
associated with the game), then an optimal "rich person's" strategy (which does 
the same action whenever the wealth is large enough) can be computed for it (in 
exponential time). They showed such strategies are not optimal for unrestricted 
solvency games and left the unrestricted case unresolved in [Q]]. 

We can classify analysis problems for OC-MDPs and OC-SSGs into two 
kinds. Quantitative analyses, which include: "is the game value at least/at most 
p" for a given p e [0, 1]; or "approximate the game value" to within a desired 
additive error s > 0. We can also restrict ourselves to qualitative analyses, which 
asks "is the game value = 1? = 0?"H We are also interested in strategies (e.g., 
memoryless, etc.) that achieve these. 

In recent work we have studied qualitative termination problems for 

OC-SSGs. For both maximizing and minimizing OC-MDPs, we showed that these 
problems are decidable in P-time, using linear programming, connections to the 
theory of random walks on integers, and other MDP objectives. For OC-SSGs, we 
showed the qualitative termination problem "is the termination value = 1?" is in 



2 The problem "is the termination value = 0?" is easier, and can be solved in polynomial time 
without even looking at the probabilities labeling the transitions of the OC-SSG. 
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NP Pi coNP. This problem is already as hard as Condon's quantitative termination 
problem for finite-state SSGs. However we left open, as the main open question, 
the computability of quantitative termination problems for OC-MDPs and OC- 
SSGs. 

Our contribution. In this paper, we resolve positively the computability of all 
quantitative approximation problems associated with OC-MDPs and OC-SSGs. 
Note that, in some sense, approximation of the termination value in the setting 
of OC-MDPs and OC-SSGs can not be avoided. This is so not only because the 
value can be irrational, but because (see Example lA.ll in Section lATI) for max- 
imizing OC-MDPs there need not exist any optimal strategy for maximizing the 
termination probability, only e-optimal ones (whereas Min does have an optimal 
strategy in OC-SSGs). Moreover, even for minimizing OC-MDPs, where optimal 
strategies do exist, they can have a very complicated structure. In particular, as 
already mentioned for solvency games, there need not exist any "rich person's" 
strategy that can ignore the counter value when it is larger than some finite N > 0. 

Nevertheless, we show all these difficulties can be overcome when the goal 
is to approximate the termination value of OC-SSGs and to compute e-optimal 
strategies. Our main result (Theorem 13.11 ) is the following: 

There is an algorithm that, given as input: a OC-SSG, Q, an initial control state 
s, an initial counter value j > 0, and a (rational) approximation threshold s > 0, 

• computes a rational number, V , such that \v' —v*\ < e, where v* is the value 
of the OC-SSG termination game on Q, starting in configuration (s, j), and 

• computes s-optimal strategies for both players in the OC-SSG termination 
game. 

For OC-MDPs, i.e., 1 -player OC-SSGs, the algorithm runs in exponential time in 
the encoding size of the OC-MDP, and in polynomial time in log(l/e) and log(j). 
For 2-player OC-SSGs, the algorithm runs in nondeterministic exponential time 
in the encoding size of the OC-SSgE 

We now outline our basic strategy for proving this theorem. Consider the case 
of maximizing OC-MDPs, and suppose we would like to approximate the optimal 



3 We shall explain after the statement of Theorem l3.ll in footnote|4j p precisely what we mean 
by computing something in nondeterministic exponential time. It amounts to the standard notion 
of nondeterministic computation used in the setting of total search problems. 
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termination probability, starting at state q and counter value i. Intuitively, it is not 
hard to believe that as the counter value goes to infinity, the optimal probability 
of termination starting at a state q begins to approach the optimal probability, v q , 
of forcing the counter to have a liminf value = -co. We prove that this is indeed 
the case. But we can compute the optimal value v_n and an optimal strategy for 
achieving it, based on results in our prior work [0 gj. For a given s > 0, we 
need to compute a bound N on the counter value, such that for any state q, and 
all counter values N' > N, the optimal termination probability starting at (q, N') 
is at most s away from the optimal probability for the counter to have lim inf 
value = -co. A priori it is not clear whether such a bound N is computable, al- 
though it is clear that N exists. To show that it is computable, we employ a subtle 
(sub)martingale, derived from solving a certain linear programming problem asso- 
ciated with a given OC-MDP. By applying Azuma's inequality on this martingale, 
we are able to show there are computable values c < 1, and h > 0, such that 
for all i > h, starting from a state q and counter value i, the optimal probability of 
both terminating and not encountering any state from which with probability 1 the 
player can force the liminf counter value to go to -co, is at most c ! /(l - c). Thus, 
the optimal termination probability approaches from above the optimal probabil- 
ity of forcing the liminf counter value to be -co, and the difference between these 
two values is exponentially small in i, with a computable base c. This martingale 
argument extends to OC-MDPs an argument recently used in for analyzing 
purely probabilistic one-counter automata (i.e., QBDs). 

These bounds allow us to reduce the problem of approximating the termina- 
tion value to the reachability problem for an exponentially larger finite-state MDP, 
which we can solve (in exponential time) using linear programming. The case for 
general OC-SSGs and minimizing OC-MDPs turns out to follow a similar line of 
argument, reducing the essential problem to the case of maximizing OC-MDPs. 
In terms of complexity, the OC-SSG case requires "guessing" an appropriate (al- 
beit, exponential-sized) strategy, whereas the relevant exponential-sized strategy 
can be computed in deterministic exponential time for OC-MDPs. So our approx- 
imation algorithms run in exponential time for OC-MDPs and nondeterministic 
exponential time for OC-SSGs. 

Related work. As noted, one-counter automata with a non-negative counter 
are equivalent to pushdown automata restricted to a 1 -letter stack alphabet (see 
[§]), and thus OC-SSGs with the termination objective form a subclass of 
pushdown stochastic games, or equivalently, Recursive simple stochastic games 
(RSSGs). These more general stochastic games were studied in fl 1 Ofl , where it was 
shown that many interesting computational problems, including any nontrivial 
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approximation of the termination value for general RSSGs and RMDPs is unde 



cidable, as are qualitative termination problems. It was also shown in [10] that 
for stochastic context-free games (1-exit RSSGs), which correspond to pushdown 
stochastic games with only one state, both qualitative and quantitative termination 
problems are decidable, and in fact qualitative termination problems are decid- 



able in NPncoNP ( 111 111 ), while quantitative termination problems are decidable in 
PSPACE. Solving termination objectives is a key ingredient for many more gen- 
eral analyses and model checking problems for such stochastic games (see, e.g., 
0, @]). OC-SSGs are incompatible with stochastic context-free games. Specifi- 
cally, for OC-SSGs, the number of stack symbols is bounded by 1, instead of the 
number of control states. 

MDP variants of QBDs, essentially equivalent to OC-MDPs, have been con- 
sidered in the queueing theory and stochastic modeling literature, see [fill 171. 



However, in order to keep their analyses tractable, these works perform a naive 
finite-state "approximation" by cutting off the value of the counter at an arbitrary 
finite value N, and adding dead-end absorbing states for counter values higher 
than N. Doing this can radically alter the behavior of the model, even for purely 
probabilistic QBDs, and these authors establish no rigorous approximation bounds 
for their models. In a sense, our work can be seen as a much more careful and 
rigorous approach to finite approximation, employing at the boundary other objec- 
tives like maximizing the probability that the liminf counter value = — oo. Unlike 
the prior work we establish rigorous bounds on how well our finite-state model 
approximates the original infinite OC-MDP. 

2. Preliminaries 

We assume familiarity with basic notions from probability theory. We call a 
probability distribution / over a discrete set, A, positive if f(a) > for all a e A. 

Definition 2.1. A One-Counter Simple Stochastic Game (OC-SSG) is given as 
3K = (Q, A, P), where 

• Q is a finite non-empty set of control states, partitioned into the states Q T 
of player Max, Q ± of player Min, and stochastic states Q P ; 

• a set A c Q x {-1,0, +1} x Q of transition rules, such that for all q 6 Q 
there is some (q, a, r) e 8; 



a map P taking each tuple (q, a, r) e A with q e Q P to a positive rational 
number P((q, a, r)), so that for every q e Q P : Z (w)& s P((q, a, r)) = 1. 



6 



A configuration is a pair (q, c) of a control state, q, and an integer counter value 
c £ Z. The set of all configurations is QxZ. An OC-SSG where Q ± = is called a 
maximizing One-Counter Markov Decision Process (maximizing OC-MDP), sim- 
ilarly <2 T = defines a minimizing OC-MDP. Finally, if Q T = Q ± = we have a 
One-Counter Markov Chain (OC-MC). 

Let us fix a OC-SSG, J?l = (Q,A,P). A ran in Jl is an infinite sequence of 
configurations to = (q , c )(<?i, c\) ■ ■ ■ such that for all i > 1 we have that (ft_i, - 
Ci-\,qi) £ A. We define for every « > the following functions: 

• State (n) : i?wrc — > 2 returns the rc-th control state: State^(o») = q n . 

• C (n) : Run — > Z returns the n-th counter value: C^ n \a>) = c n . 

A finite prefix, w = (q , c ) • • • (qu, q), of a run is called a finite path, and 
len(w) := & is its length. We denote by Run the set of all runs, and by Run(w) 
the set of all runs starting with a finite path w. Closing the set {Run(w) | 
w is a finite path} under complements and countable unions generates the stan- 
dard Borel cr-algebra of measurable sets of runs. Note that the functions State (n) 
and C (,,) have measurable pre-images. 

A strategy for player Max is a function, cr, which to each finite path w = 
(<7o,co) • • • (lk,Ck), also called history in this context, where q k £ Q T , assigns a 
probability distribution on the set of rules of the form (q^, a, r) £ A. It is called 
pure if cr(w) assigns probability 1 to some transition, for each history w. We call 
cr counterless if cr(vv) depends only on the last control state, q k . Strategies for Min 
are defined similarly, just by substituting Q T with Q ± . 

Assume that a pair (cr, n) of strategies for Max and Min, respectively, is fixed. 
Consider a finite path w = (qo, c ) ■ • ■ (qk, cu) and a rule q - c,_i, q t ) £ A, 
1 < i < k. We assign to this rule a weight, x,-, as follows: If e Q P then 
Xj = Prob((qi-i,Ci - c,_i, <?,)). If <?,_i e Q T then x ; - is equal to the probability of 
(?,■_!, Cj - Cj_i , ft) assigned by cr((<5r , c ) • • • (ft_i, q_i)), and similarly for ft M G Q ± 
and n. The weight of w is then x w = n'='i (w)1 x ^ where the empty product is equal 
to 1. Once we also fix an initial configuration, (q,c), we obtain a probability 
measure This is defined by setting ¥^^(Run(w)) = if w does not start with 
(q, c), and Yff JRun(w)) = x w if w starts with (q, c). This and the requirement of 



(q,c) 

countable additivity of a measure already uniquely describes (see, e.g., fl!6 . 



p. 30] for the case of MDPs. The extension of this to SSGs is straightforward.) 
If Jl is a maximizing OC-MDP, a minimizing OC-MDP, or a OC-MC, we denote 
the probability measure by P^ c) , ^"[ qc y or ^( qc y respectively. 
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Objectives. In this paper, an objective for an OC-SSG is a measurable set of runs. 
Player Max tries to maximize the probability of this set, whereas player Min tries 
to minimize it. Given an objective, O, for a OC-SSG, Ji, and a configuration, 
s = (q, c), we define the value in s as 

Val?,(0, s) := supinf P^(0) = inf supP^(O) . 



The latter equality follows from Martin's Blackwell determinacy theorem [15]. 
We write just Val(0, s) if Jl is understood. For an s > 0, a strategy <x for Max is 
called s-optimal in s if F^' n (0) > Val(0, s) - s for every n. Similarly a strategy 
n for Min is s-optimal in s if Y^' n (0) < Val(0, s) + s for every cr. A 0-optimal 
strategy is called optimal. Note that by determinacy both players have e-optimal 
strategies for every s > 0. 

The key objective is the termination objective: 

Term := {io 6 Run \ 3n : C (n \co) < 0}. 

The name "termination" stems from the connection to one-counter automata. 
Such automata also have a finite number of control states and a non-negative 
counter, and a run can be considered to "terminate" upon hitting counter value 
0. OC-SSGs do not necessarily halt when the counter is 0, and allow negative 
counter values. However, this difference is irrelevant from the perspective of the 
termination objective, for which only the part of runs with non-negative counter 
values matter. 

Games without a Counter. In our arguments we also use the notion of Simple 
Stochastic Games (SSGs) of Condon [7D, which are similar to OC-SSGs. The 
main difference is the lack of a counter, and the focus on the objective of reaching 
a distinguished sink state. 

Definition 2.2. A simple stochastic game (SSG) is a tuple Q = (S, ,Prob), 
where 

• S is a finite set of states, partitioned into the states S T of player Max, S ± of 
player Min, and stochastic states S P ; 

• ^ c 5 x S is a transition relation such that for every state s e S there is at 
least one state r e S such that s ^> r\ 
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• Prob is a probability assignment which to each s e S P assigns a rational 
probability distribution on its set of successors, where for a state s e S P its 
successors are defined to be the set {r \ s^ r}. 

If S ± = we call Q a maximizing Markov decision process (maximizing MDP). If 
S T = we call it a minimizing MDP. If 5 T = 5 ± = 0we call Q a Markov chain. 

The SSG also comes with a distinguished sink state sq e S , and this implicitly 
defines the reachability objective "reach so" defined by runs a> which visit so- 

Runs, strategies, probability measures and values with respect to objectives 
are defined analogously to those for OC-SSGs, just by removing references to the 
counter. In particular, runs are sequences of states. The following is well known. 

Fact 2.3. (See, e.g., /0 S [iol/ . ) For both maximizing and minimizing MDPs, 
optimal pure memoryless strategies for reachability exist and can be computed, 
together with the optimal reachability value, in polynomial time. 

3. Main Result 

Theorem 3.1 (Main). There is an algorithm that, given an OC-SSG, SK, a configu- 
ration, (q, i), i > 0, and a rational s > 0, computes a rational number, v, such that 
\Va\(Term, (q, /)) — v\ < s, and computes strategies o~ and n for the Max and Min 
player, respectively, such that both cr and n are e-optimal starting in (q, i) with 
respect to the termination objective. The algorithm runs in nondeterministic time 
exponential in and polynomial in log(z') and log(l/e). If J{ is an OC-MDP, 
then the algorithm runs in deterministic time exponential in and polynomial 



To make precise the meaning of this theorem, we have to spell out precisely what we mean 
by a nondeterministic algorithm that computes v, cr and t within given resource bounds. This is 
a standard notion for total search problems. We will say a nondeterministic algorithm (i.e., non- 
deterministic Turing machine) computes v, cr and r within the specified resource bounds (namely, 
exponential time) if given any input the algorithms halts in exponential time on all computation 
paths, and furthermore, if the input is not well-formed the algorithm "rejects" it on all computation 
paths, but if the input is well-formed (i.e., if it is given a well formed OC-SSG, initial configu- 
ration (q, 0, and e > 0) the nondeterministic algorithm: (a) has at least one accepting computation 
path; and (b) on every accepting computation path it outputs values v, cr, and ji which satisfy 
that |Val(7erOT, (q, z)) — v| < s and such that cr and n are e-optimal strategies for Max and Min, 
respectively, for the given input OC-SSG J\. and initial configuration (q, i). On rejecting compu- 
tation paths the algorithm need not output anything. Note that the outputs on different accepting 
executions may be different, but they must all satisfy the required specification. 
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Let us first briefly sketch the main ideas in the proof of Theorem 13.11 
First, observe that for all q e Q and i < j we have that Val(7erm, (q, i)) > 
Va\(Term, (q, j)) > 0. Let 

H q := limVal(7erm, (q,i)). 

i—*oo 

Since \x q < Val(7erm, (q, i)) for an arbitrarily large i, Player Max should be able to 
decrease the counter by an arbitrary value with probability at least fj. q , no matter 
what Player Min does. The objective of "decreasing the counter by an arbitrary 
value" can be formalized directly as the following "limit" objective, which has 
useful connections to termination ftj^: 

Limlnf{= -oo) := [oj e Run | lim inf C (n \(o) = -oo}. 

OC-SSG with this objective are determined, which means that the following value 
is defined for every q e Q: 

v q '■= Ya\(LimInf(= -oo),(q,n)), where n = 0. (1) 

Remark 3.1. Observe that due to the nature of Limlnf(= -oo) we would obtain the 
same value v q if we used any other value of n. It will be often the case that we will 
measure the (optimal) probability of some events, where the resulting number will 
not depend on the initial counter value. From now on, in such cases we will spec- 
ify only the initial state, so, e.g., (1) would become v q := Val(LimInf(= -oo),q). 

One intuitively expects that \x a = v q , and we show that this is indeed the case 
(see Corollary 13 . 13b . Further, by [2, Theorem 2], v q is rational and computable in 
non-deterministic time polynomial in Moreover, both players have optimal 
pure counterless strategies (cr*,n*) computable in non-deterministic polynomial 
time. For OC-MDPs, both the value v q and the optimal strategies can be computed 
in deterministic time polynomial in ||J?T||. 

Obviously, there must be a (sufficiently large) Af such that Val(7erm, (q, i)) - 
H q < s for all q e Q and i > N. We show that an upper bound on Af is computable, 
and is at most exponential in and polynomial in log(l /s), in Section [3TT1 As 
we shall see, this part is highly non-trivial. For all configurations (q, i), where 
i > N, the value Val(7erm, (q, i)) can be approximated by [x q (= v q ), and both 
players can use the optimal strategies (<x*, n*) for the Limlnf(= -oo) objective. For 
the remaining configurations (q, i), where i < N, we consider a (finite- state) SSG 
Q obtained by restricting ourselves to configurations with counter between and 
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N, extended by two fresh stochastic states s , s\ with self-loops. All configurations 
of the form (q, 0) have only one outgoing edge leading to s , and all configurations 
of the form (q, N) can enter either sq with probability v q , or si with probability 
1 - v q . In this SSG, we compute the values and optimal strategies for the objective 
of reaching sq. This can be done in nondeterministic time polynomial in the size 
of Q (i.e., exponential in If is an OC-MDP, then Q is a MDP, and the 

values and optimal strategies can be computed in deterministic polynomial time 
in the size of Q (i.e., exponential in ||J?l||) by linear programming (this applies 
both to the "maximizing" and the "minimizing" OC-MDPs). Thus, we obtain 
the required approximations of Val( Term, (q, i)) for i < N, and the associated s- 
optimal strategies. 

Proof of Theorem 1X71 The algorithm is given an OC-SSG Ji = (Q, A, P), an ini- 
tial configuration (q, i), and a rational number s > 0, as input. Recall that for 
r e Q we set v r := Yal(LimInf(= -oo), r). The algorithm does the following: 

1 . Compute a pair {cr*, n*) of pure counterless strategies, for players Max and 
Min, respectively, which are optimal for Limlnf(= -oo) starting at every 
state r e Q. Compute v r , for every r e Q. 

2. Compute such that Va\(Term, (r, j)) - v r < s for all r e Q and j > N. 

3. Jfi>N then return v q , cr*, n*. 

4. Otherwise apply the algorithm from Lemma 13341 to M, s, (v r ) re g, N, cr*, n* 
and return v^,), cr, n from its output. 

A key step is obviously stepHl which is not described here. We shall describe 
and prove the correctness and complexity of that step in Lemma 13 .14[ If we can 
carry out the computations as specified in Steps [TJ and HI then the correctness of 
the output in Step 3 holds by definition. Let us now evaluate the complexity of 
the first two steps (using some of our earlier results, and some results that will be 
established in Section [37Tb : 

• (Step ED) The values v r , r e Q, which have polynomially big encoding 
by D2i Proposition 9], can be guessed and verified in polynomial time by yi 
Theorem 2]. Strategies exist that are optimal with respect to Limlnf{= -oo) 
and are pure and counterless by [2, Proposition 7]. We can guess such 
a strategy cr* and verify, using the numbers v r , that it is Limlnf(= -oo)- 
optimal for Max; similarly for n* and Min. If J?l is an OC-MDP, all the 
above can be computed deterministically in time polynomial in ||A||. 
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• (Step [21) Fixing n* in Jl we obtain a maximizing OC-MDP JT. Lemma [3T91 
applied to allows us to compute deterministically a bound Af 6 
exp(||J?l'|| 0(1) ) • 0(log(l/e)) such that in W, Va\(Term, (r, ;')) - v f - < e for 
all r e Q and j > A r . By Lemma [3 .121 this N satisfies the requirements of 
step [21 

□ 

3.1. Bounding counter value N for maximizing OC-MDP s 

Consider a maximizing OC-MDP J?t = (Q, A, P). Recall the definition of v q 
from (Q3 and the notational convention introduced in Remark 13 . 1 1 Specifically, 
we have v q = sup^. F^(LimInf(= -oo)) for all q e Q. Given s > 0, we show 
here how to obtain a computable (exponential) bound on a number Af such that 
|Val(7erm, (g, /)) - v q \ < s for all i > N. We denote by T the set of all states q 
with v fy = 1, and we define the objective of reaching T as follows: 

Reachj '■= {to e Run \ State (,) (o>) e T for some i > 0}. 

Further, we define the objective ->Reach T := Run \ Reachj. 

Fact 3.2 (cf. yi Proposition 3.2]). The number v q is the maximal probability of 
reaching T from q (see Remark\3J), i.e., 

v q = Nal(Reach T , q) = sup F^(Reach T ) . 

CT 

Lemma 3.3. For all q e Q and i > 

v q < Nal(Term, {q, i)) < sup V[ q i) (Term D ^Reach T ) + v q . (2) 

Proof. The first inequality is obvious. Because Limlnf(= -oo) n Run((q, i)) Q 
Term n Run((q, z')), we have Val(7erm, (q, i)) = 1 for all q e T, i > 0, from 
which the second inequality follows by an easy application of the union bound. 
Namely, for under strategy cr, the event of termination can be split into the event of 
terminating and not reaching T unioned with the event of terminating and reaching 
T . The probability of the latter event is clearly upper bounded by v q . □ 

To provide the promised bound on Af we will prove an upper bound on 
sup^P^^Term n ->Reach T ) which decreases toward exponentially fast in i. We 
will first define a suitably restricted class of OC-MDPs (we call them "rising" 
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OC-MDPs) and find such a bound using martingale theory (Lemma [3 .81 ) for that 
restricted class. We then extend the results to all OC-MDPs (Lemma l3~l?l ) by show- 
ing that for every OC-MDP, 3K, there is a polynomially bigger "rising" OC-MDP, 
tR, which "embedds" in it the states of the original OC-MDP, and preserves the 
rate with which sup^ Pf ^(Term n ->Reach T ) reaches from those corresponding 
states. We will make this precise later. 

To be able to use the martingale theory methods for rising OC-MDPs we need 
to guarantee that in each rising OC-MDP, under every pure counterless strat- 
egy, liminf/^oo C (l) /z is almost surely positive. This value is sometimes called 
the mean-payoff, see also l^. We now state the definition of a rising OC-MDP 
using two simple properties, and show that these two properties guarantee that the 
mean-payoff is almost surely positive. 

Definition 3.4. A pure counterless strategy, cr, is idling, if there is a state 
q e Q, such that P£ t0) (3/ > : State® = q\ = 1 and for all i > 0: 

P^ 0) (State (,) = q => C (i) = 0) = 1. 

A maximizing OC-MDP is called rising if T = {q e Q \ v q = 1 } = and no 
pure counterless strategy is idling. 

Lemma 3.5. Let J{ = (Q, A, P) be a rising OC-MDP. Then for every pure coun- 
terless strategy, cr, and every q 6 Qwe have P^lim inf ,_>„<, C w /i > 0) = 1. 

Proof. Let us fix a pure counterless strategy cr. Because <x is counterless, there 
is a collection of disjoint subsets of Q, called ergodic sets, or bottom strongly 
connected components (BSCCs), in the standard theory of Markov chains, such 
that almost all runs end up visiting infinitely often exactly the states of some of 
the BSCCs. Let us focus, for a while, on a single BSCC C c Q. By standard 
results, for each pair of states r,seC the play from r almost surely visits s, 
and the expected time to visit s from r is finite. As a consequence, there is 
a unique constant p such that P^lim inf C (,) /i > 0) = p for all r e C, be- 
cause lim inf C (l) /i > is a prefix-independent property. Moreover, we can 
use the result from 1E2I Theorem 3.2] which says that in a presence of r 6 C 
such that P^lim inf/^co C (,) /z > 0) > there must also be some s e C such that 

P^lim inf C®/i > 0) = 1. Thus either p = or p = 1. Let us first prove by 
contradiction that p = 1, and then we shall consider the more general case where 
rather than assuming r e C for a BSCC C, we consider an arbitrary start state in 
the entire OC-MDP. 
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Assume that p = 0. By Fact 13.21 T = implies that 
P^liminf^oo C w = -oo) = for all r e C. It is easy to see that this im- 
plies that P^(liminf^ ro C (0 A > 0) = 1 for all r e C and all strategies cr. 

Due to our assumption of p = 0, P^liminf,^ C (l) /i : = 0) = 1 for all r e 
C. Now Lemma 3.3 from [|3j] says: "For all q, the pure counterless strate- 
gies r which satisfy P^liminf^oo C (,) = -oo) = 1 are exactly those which sat- 
isfy P^liminf^C®/* < 0) = 1 and P^ 0) ( 3z ' : c ® < °) > °" But we do 
not have any strategies of the first kind, so there is a state r e C such that 
P£ 0) (V? : C® > 0) = 1. If P^ >0) (3i > : State (0 = r A C® > 0) > then because 
the expected time between two visits to r is finite, it can fairly easily be estab- 
lished that P^liminf^oo C® /i > 0) > 0, which would contradict the assumption 

p = 0. Thus P£ 0) (Vi > : State 01 = r ^ C® = 0) = 1 and cr is thus idling. But 
this is not possible because by assumption Jl is rising, so the assumption of p = 
cannot be satisfied, and we have proved that p = 1 . 

Now consider an arbitrary state q e Q. Because liminf^oo C®/i > is 
prefix-independent, and almost every run from q reaches some BSCC, where 
liminf^co C (l) /i > is satisfied almost surely, we have P^liminf^ooC®/? > 0) = 
1. □ 

Now we define a suitable submartingale for a given rising OC-MDP, and use 
Azuma's inequality to show that sup^. P^ t) {Term n -iReach T ) decreases to expo- 



nentially fast in i. Reca 
if, for all i > 0, E[|m®| 
we further assume that 



1 that a stochastic process m> ', m> \ ... is a submartingale 



< oo, and Ejm ( ' +1) | m (1) , . . . , m (,) j > m® almost surely. If 
OT 0+i) _ m (0| < c almost surely for all i > 0, we can apply 
the Azuma-Hoeffding inequalit^, which says that the following holds for all f > 
and n > 0: 2 

P(m w - m (0) < t) < ex p(^J 0) 

Let J{ = (Q, A, P) be a rising OC-MDP. Since is rising, the mean-payoff 
(i.e., the average change of the counter per transition) is almost surely positive for 
all pure counterless strategies. Since there are only finitely many pure counterless 



5 In the literature (see, e.g. II 1310 . the Azuma-Hoeffding inequality is usually stated for martin- 
gales and supermartingales where is takes the form P(m ( "' - m <0) > tj < exp(-? 2 /2«c 2 ). Inequal- 



ity (f3]l is obtained just by realizing that if OT (0) ,m (1) 
is a supermartingale. 



is a submartingale, then -mP\ -m! 
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strategies, there is even a fixed bound x > such that the mean payoff is larger 
than x almost surely. This means that after performing i transitions, the counter 
should increase at least by i ■ x on average. Hence, one might be tempted to define 

(o - I c(!) ~ 1 ' x if c0) > for a11 J' °^J < '» 
lm (i_1) otherwise. 

and try to prove that m (0) ,m (1) , ... is a submartingale. Unfortunately, this does 
not work, because some control states may not allow to increase the counter by x 
or more. A similar problem was encountered previously in [6] in the context of 
purely probabilistic OC automata, and the difficulty was overcome by employing 
"artificial" additive constants that compensate the difference among the individual 
control states. We show that a similar trick works also in our setting. That is, we 
aim at designing a submartingale of the following form: 

m (0 . = ( c ® + z state«> -i-x if C°> > for all j, 0<j< i, 
lm (,_1) otherwise. 

Here z q is a suitable constant that depends only on q. However, it is not clear 
whether the constants z q can be chosen so that m (0) , m (1) , . . . becomes a submartin- 
gale, and what is the size of these constants if they exist. This problem is solved 
simply by observing that the defining property of a submartingale (see above) im- 
mediately gives a system of linear inequality constraints that should be satisfied 
by z q - For example, suppose that C (,) = j and State (,) = q where q e Q P . For 
every Max strategy, we would like to have that E[m (,+1) | m w ] > m {,) . This means 
to ensure that this inequality is satisfied for every outgoing transition of (q, j), i.e., 
for every (q, k, r) e A we wish to have 

E[/n (,+1) | m (!) ] = (J + k) + z r - 0" + 1) • x > m (i) = j + z q -i-x. 

This yields z q < —x + k + z r - Note that if q is stochastic, we need to consider the 
"weighted sum" of the outgoing transition of (q, j) instead. Thus, we obtain the 
system of linear inequalities of Figure Q] 

Now we show that the linear system of inequalities given in Figure \T\ has a 
non-negative rational solution, and derive a bound on its size. Then, we take 
this solution, define the associated submartingale, and use Azuma's inequality to 
derive the desired result. 

Lemma 3.6. Let J{ = (Q, A, P) be a rising OC-MDP. Then there is a non-negative 
rational solution (x, (z q ) qe o) e Q lel+1 to such that x > 0. (The binary encoding 
size of the solution is polynomial in \\&\\.) 
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Zq < —x + k + z r for all q e Q T and (q, k, r) e A, 

z q < -x + Z(9,/t, ( )eA ^((?, fc, r)) • (fe + z r ) for all q e Q P , 

x>0. 



Figure 1 : The system X of linear inequalities over x and z q , q e Q. 

Proof. We first prove that there is some non-negative solution to £. with x > 0. 
The bound on size then follows by standard facts about linear programming. To 
find a solution, we will use optimal values for minimizing discounted total reward 
in jft. For every discount factor, A, < A < 1, and a strategy, t, we denote the 
discounted total reward, starting under r, by DTR A q {r) := £;>o A' • E^[c (,+1) - C®], 
and set DTR A q (*) := inf r DTR a C[ (t). We prove that there is some /i, such that setting 
z 9 := ZMHJOO and 

x := min ({k + DTR A r {*) - DTR A (*) \ q e Q T , (q, k, r) e A} 

U {P((q, k, r)) • (k + DTR A r (*) - DTR A (*)) \ q e Q P , (q, k, r) e A}) 

forms a non-negative solution to £, with x > 0. 

Now we proceed in more detail. First we choose the right A. Lemma 13.51 
and our assumptions guarantee that P^lim inf C (l) /z > Oj = 1 for every pure 

counterless strategy r. Thus Zi>o •E^[C (,+1) - C (i) ] = 00, and hence for every such 
r there is a A r < 1 such that DTR A (r) > for all q e Q and A > A r . There 
are only finitely many pure counterless strategies, and we choose our A to be 
A := max T A T < 1 . 

Having fixed the A above, we now prove that there is a pure counterless strat- 
egy cr, such that DTR A (*) = DTR A (o~) for all q. By standard results (e.g., H), 
translated from the terminology of MDPs with rewards to that of OC-MDPs, 
for a fixed state, q, there is always a pure counterless strategy o~ q such that 
DTR A {a q ) = DTR A (*). Moreover, this strategy has to play optimally in succes- 
sors of q as well, thus there is in fact a single pure counterless strategy cr such that 
for all q: DTR A (o~) = DTR A (*). 
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Finally, x > 0, because for all q e Q P 
DTR A {a) = V X ■ W q [c {M) - C®] 



/>o 



J] P((q, K r)) ■ k + X ■ J] X ■ E^[C ( ' +1) - C®] 

(q,k,r)eA V ;>0 

^ P(fo*,r))-(* + A-D7K?(<r)) 



< ^ P((?,*,r))-(* + D7R>-)), 

the last inequality following from DTR A {a) > for all r e Q; and similarly for all 
q & Q T and (g, fc, r) G A 

DTR A {a) = Yj X ■ E^[C (I+1) - C®] 

!>0 

< k + A • V X • E^[C (,+1) - C®] =k + A- DTR A {o~) <k + DTR A {cr). 



Here the first inequality follows from the fact that cr minimizes the discounted 
total reward. □ 

Given the solution (x, (Zq)qeQ) g Q lei+1 from Lemma [3T6l we define a sequence 
of random variables jm w } ; >o by setting 



m® :-- 



\ C (0 + Zstatef) - i • x if C 0) > for all j, 0<j< i, 
I m (!_1) otherwise. 



We shall now show that m® defines a submartingale. 

Lemma 3.7. Let 3K = (2, A, P) be a rising OC-MDP and {m®} ; >o defined as 
above. Under an arbitrary strategy t and with an arbitrary initial configuration 
(q, n), the process {m®},>o is a submartingale. 

Proof. Consider a fixed path, u, of length i > 0. For all j, < j < i the values 
C (;) (a») are the same for all co e Run(u). We denote these common values by 
C°'V), and similarly for State 0) (w) and m (j \u). If C u \u) = for some j < i, 
then m (,+1) (tt>) = m { '\oj) for every a; G Run(u). Thus E^ n) [m ((+1) | Pwn(w)] = 
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m w (w). Otherwise, consider the last configuration, (r, /), of u. For every possible 
successor, (r', /'), set 

fr(H)((r,0^(r',r)) if r e Q T , 
Rr '' n ' \Prob((r,t)^(r',V)) if r e Q P . 

Then 

E fen)[ C( ' +1) - C ® + Zstate< i+1 > " * I = -*+ ^ /? ( ^ +jfc) -(£+I r ^ Zr- 

(r,<:,r')€A 

This allows us to derive the following: 

E fen)K +1) I = E U[ C °' +1) + W+« - 0' + 1) • x | 

= C (0 («) + E[ 9>n) [c (i+1) - C (i) + z State (, + D - x | Run(u)] -i-x 

> C { '\u) + Z S tate«(M) -t-X = m W (w). 



Now we have prepared all that we need to bound sup cr P^ () (rerm) for rising 
OC-MDPs. 

Lemma 3.8. Given a rising OC-MDP, one can compute a rational constant 
c < 1, and an integer h > such that for all i > h and q £ Q 

supP? 'JTerm) < 



1 -c 

1 0(1) V 



Moreover, c e exp(l/2 M ) and h e exp(||J?l|| 0(1) ). 

Proof. Denote by Termj the event of terminating after exactly j steps. Further set 
Zmax '■= max geQ z e - min qe Qz q , and assume that C (0) > z max . Then the event Termj 
implies that m (7) - m (0) = z State u> - j ■ x. - C (0) - z State (<» < -y • x. Finally, observe 
that we can bound the one-step change of the submartingale value by z max + x + 1. 
Using the Azuma-Hoeffding inequality for the submartingale {m M } n > (see, e.g., 
Theorem 12.2.3 in (11311 ). we thus obtain the following bound for every strategy <x 
and initial configuration (q, i) with i > z max : 

Vfa(Tennj) < P^(m« - m<°> < -j • x) < exp ( - ■ 
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We choose c := exp L ^t—t? ) < 1 and h := rz max ~L and observe that for all 
qeQ,i>h: 

;>i ./>« 

The given bounds on c and /z are easy to check, and the detailed computation can 
be found in Section [A3] □ 

As a final step, we extend the results to the general case of (not necessarily 
rising) OC-MDPs. 

Lemma 3.9. Given a maximizing OC-MDP, 3\! = (Q', A', P'), one can compute a 
rational constant c < 1, and an integer h > such that for all i > h and q e Q 

c' 

supP^ (7erm n ^Reach T ) < - . 

Moreover, c 6 exp(l/2" J?l '" 0< ' ) ) and h e expdl^'H ^). As a consequence, a number 
N such that \Va\(Term, (q, /)) - sup^ ¥^(LimInf(= -oo))| < s for all q e Q' and 
i > N satisfies N < max{h, [log c (s ■ (1 - c))]} 6 exp(||^'|| 0(1) ) • 0(log(l/e)). 

Proof. The heart of the proof is a reduction which computes a polynomially big- 
ger rising OC-MDP J{ = (Q, A, P) from SI', uses the algorithm from Lemma [3T8l 
to compute the bounds c and h for and returns the very same numbers for 3\! . 
The reduction itself is in two steps, first computing an OC-MDP J?T = (Q, A, P) 
from M' such that Ya\(LimInf(= -oo), q) < 1 for all q e Q in J?T, and then from 

The first step, from 3\! to J?l is easier. Recall that we called T = T(Q) 
the set of all q e Q such that Val(LimInf(= -oo),g) = 1. Here we use it 
also to denote the analogous subset T(Q') of Q' of all states q e Q such that 
Ya\(LimInf(= -oo),q) = 1 in 3V . Theorem 3.1 from [3] guarantees that we 
can compute the set T(Q') in time polynomial in \\Sl'\\. Then we set Q := 
(Q' \ T(Q')) U {trap}, with Q P = Q' p \ T(Q'). The state trap has a unique out- 
going rule in A: (trap, +l,trap). The rest of the rules in A are derived from A' 
by redirecting all rules ending in T(Q') to trap. P' is derived from P accordingly. 
It is easy to see that T(Q) = 0, because T(Q') n Q = and by construction, 
T{Q)QT{Q'). 

A technique to achieve the second step was already partially developed in 
our previous work where we used the term "decreasing" for rising strategies. 
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There we gave a construction which preserves the property of optimal termination 
probability being = 1 . We in fact can establish that a similar construction pre- 
serves the exact termination value. Because the idea is not new, we leave details 
to Section IAT21 The important properties of tR are stated in the following lemma, 
the proof of which can be found in Section IA.2[ along with the formal definition 
of .71. 

Lemma 3.10. There is a deterministic polynomial-time algorithm which given a 
maximizing OC-MDP, = (Q,A, P), computes another maximizing OC-MDP, 
tR = (Q, A, P), and a map f : Q — » Q satisfying: 

. \m\ 6 o(\\n\ 4 y 

• There are no idling pure counterless strategies in ^R. 

• Va\(Term, (q, z')) = Va\(Term, (f(q), i))for all q 6 Q and i > 0. 

• IfVal(LimInf(= -oo), q) < 1 for all q G Q in ^R, then ^R is rising. 

In particular, note that ^R obtained from our tR is rising. Now let q e Q' be a 
state of W, such that q <£ T(Q'). We know that sup^ F£ qi) (Term n ->Reach T ) in M' 
equals sup^ ^(Term) in tR, which in turn equals sup c .P^ ( ^ ;) ( Term) in tR. Note 
that | \&\ | e ||:?[' 0(1) ||. Applying Lemma [3TH1 to & finishes the proof of the first part 
of Lemma [3T9l 

In the second part the inequality N < max{/i, riog f (e • (1 - c))~|} is an easy 
computation. Verifying that |log c (£ • (1 - c))] e exp(||J?T|| 0(1) ) is also easy and can 
be found in Section IA.31 □ 

Also, as an immediate consequence of Lemma [331 and Lemma [3791 we obtain 
the following: 

Corollary 3.11. For every q e Q, v q = lim,-^ Val(7erm, (q, i)). 

3.2. Bounding N for general SSGs 

By [0, Proposition 7], player Min always has an optimal pure counterless strat- 
egy, n*, such that 

Val(LimInf(= -oo),q) = mpF^' n ' (Limlnf(= -oo)) . 

IT 

By fixing the choices of n* in Jl we obtain a maximizing OC-MDP, Jl* = 
(Q*,6*,P*\ where Q* p = Q P U Q ± , Q* T = Q T , 6* := {(q,k,r) e 8 \ q e 
Qp U Qt V n*(q) = r}, and P* is the unique (for W) extension of P to states 
from Q ± . 
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Lemma 3.12. Let 3\ = (Q, A, P) be an OC-SSG, n* a Limlnf(= -oo)-optimal 
strategy for Min, and J7T the minimizing OC-MDP given by fixing n* in as 
described above. Then for all q e Q: 

Vz > : lim Ya\^(Term, (q, j)) < YaX^Term, (q, i)) < Na\^(Term, (q, i)). (4) 
lim Va\^(Term, (q, j)) = lim Val?,. (Term, (q, j)) . (5) 

Proof. Since for all j we have VaL^Term, (q, j + 1)) < Ya\^(Term, (q, j))„ we 
obtain the first inequality in ©. The second inequality in © follows from the 
fact that in J?T* we restricted the possible moves of Min. The "<" direction in © 
follows directly from ©, and the other direction is obtained as follows: 

lim^oo Wa\^(Term, (q, i)) > Va\^(LimInf(= -oo), q) (immediate) 

= Va\^.(LimInf(= -oo),q) (immediate) 
= lim,^oo Val^. (Term, (q, i)) (by Corollary 13.111 ) 

□ 

Corollary 3.13. For every control state q of an OC-SSG we have that 
lim Na\ft(Term, (q, i)) = Wa\^(LimInf(= -oo),q). 

i— >co 

3.3. Analyzing a Finite Segment of Configurations 

Lemma 3.14. There is a nondeterministic algorithm^ that given an OC-SSG, J{ = 
(Q, A, P), and a rational s > as input, and, in addition, given the following 
precomputed values: 

• v q = Ya\(LimInf(= -oo),q) for every q e Q 

• an integer N > such that < Ya\(Term, (q, /)) - v q < sfor all q e Q and 
i>N, 

• and a pair of strategies (o~*,n*) for Max and Min which are optimal for 
Limlnf(= -oo) in all q e Q; 

computes the following output: 



6 Again, see footnote |4] for a precise explanation of what we mean by a nondeterministic algo- 
rithm in this context. 
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• a number v^) for each q e Q and i < N such that < Val(7erm, (q, i)) - 

V (q,i) ^ £ > 

• and a pair of strategies (cr, ft) for Max and Min, respectively, which are 
s-optimalfor termination in all configurations. 

The algorithm runs in time polynomial in N-\\3i\\. Furthermore, if is an OC- 
MDP then the algorithm is deterministic. 

Proof. The first idea is to analyze the following SSG, Q, which is essentially J?l 
restricted to configurations with counter value between and N. The set of states 
of Q is {{q, i) \ q e Q, < i < N} U {so, s\}. The ownership of the states of the 
form (q, i), < i < N is the same as in 3\, the states sq, S\ and (q, i) for q e Q, 
i 6 {0,N} are stochastic. For < i < N, there is a transition (q,i)^> (r, j) iff 
(q, j - i, r) e 6. Probabilities of these transitions, where applicable, are derived 
from P. Vertices of the form (g,0), and the state s have only one transition, to 
So. Vertices of the form (q, N) have transitions to both sq and Si, and the state si 
has only the self-loop transition. The probability of a transition (q, N) ~» sq equals 
Val(7erm, (q, N)) for all q. 

Clearly we have sup^ mf n P^* (reach s ) = Va\(Term, (q, /)) for all q e Q and 
i < N. The problem is that the transition probabilities from (q, N) in Q are un- 
known (and may even be irrational). We will not actually construct Q. To use such 
reachability analysis for approximating the termination values we have to switch 
to a slightly perturbed SSG, Q' . 

Q' is almost identical to Q: it has the same sets of states and transitions. The 
only difference is that in Q' the probability of a transition (q, N) ^ s equals v q for 
every q (and the probability of (q, N) ^ S\ changes appropriately to make the sum 
1). Observe that since v q < Va\(Term, (q, N)), for every (q, i) where i < N: 

sup inf Pf^ (reach s in Q') < sup inf Pf^ (reach s in Q) = Y<A(Term, (q, i)). 

cr * m) cr * (q '> 

On the other hand, by our assumption on the values v q and N, 

sup inf P^ (reach s in Q) - s < sup inf P^" (reach s in Q') . 

cr " cr " 

Thus Val(rerm, (q, i)) - sup^ inf^. P^ } (reach ^ in Q') < s, and we may output 
V( 9>1 -) := sup^ inf n^ff, (reach Sq in Q') . By standard results, see, e.g., flVD , such 
reachability values have a binary encoding polynomial in \\Q'\\, and after a mem- 
oryless optimal strategy (having size polynomial in \\Q'\\) is guessed, the values 



22 



can be computed in time polynomial in \\@'\\. If J\ is an OC-MDP, then Q' is an 
MDP, and for MDPs the reachability values, and optimal strategies, can be com- 
puted in deterministic polynomial time. Let us suppose we have computed the 
optimal strategies cr R , ttr for reachability in Q' . The resulting strategy <x for the 
given OC-SSG J?T is defined as follows: In configurations with counter value be- 
tween and N it plays according to the optimal reachability strategy of Max in 
Q' . Once a configuration with a counter value above N is visited it switches to 
playing as cr* forever, where cr* is the optimal stategy we assume we are given for 
Limlnf(= -oo). Now for all configurations (q, i), < i < N, and strategies n for 
Min, the number V( q j) gives a lower bound on the probability that under (cr, n) a 
run either terminates without exceeding counter value N, or hits some (r, N) and 
then satisfies Limlnf(= -oo). This probability itself is a lower bound for the prob- 
ability that a run either terminates without exceeding counter value N, or hits some 
(r, AO and then terminates, which is in other words the probability of termination. 
This means that <x is e-optimal, because Val(7erm, (q, i)) - V( q $ < s. 

Analogously we define the strategy Jt. Consider again some {q, i), < i < N, 
and cr for Max. The number gives now an upper bound on the probability 
that under (cr, n) a run either terminates without exceeding counter value A^, or 
hits some (r,N) and then satisfies Limlnf(= -oo). From the properties of Af, this 
probability is by at most e lower than the probability of termination. Because 
V( 9 ,,) < Ya\(Term, (q, i)) we obtain that also ft is e-optimal. □ 

4. Conclusions 

We have shown that one can e-approximate the termination value for OC-MDP 
(and for OC-SSG) termination games, and compute 6-optimal strategies for them, 
in exponential time (and in nondeterministic exponential time, respectively). 

Our results leave open several intriguing problems. An obvious remaining 
open problem is to obtain better complexity bounds. In particular, we know of 
no non-trivial lower bounds for OC-MDP approximation problems, and it re- 
mains possible that approximation of the value for OC-MDPs can be computed 
in polynomial time. Our results also leave open the decidability of the quantita- 
tive termination decision problem for OC-MDPs and OC-SSGs, which asks: "is 
the termination value > pT for a given rational probability p. Furthermore, our 
results leave open the computability of approximating the value of selective ter- 
mination objectives for OC-MDPs, where the goal is to terminate (reach counter 
value 0) in a specific subset of the control states. Qualitative versions of selective 
termination problems were studied in [j^L 13]- 
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Figure 2: An OC-MDP where Player Max does not have optimal strategies for termination. Signed 
numbers represent counter increments, unsigned are probabilities of transitions. 

A. Appendix 

A. 1. Non-existence of Optimal Strategies for Termination 

In the following example we show that even in the special case of OC-MDPs 
there may not be any optimal strategies for maximizing the termination values. 
More precisely, there is a maximizing OC-MDP, SK, and (infinitely many) config- 
urations (s, i) such that for all strategies <x: ^(Term) < Va\(Term, (q, i)). 

Example A.l. Consider the maximizing OC-MDP, Jl, given in Figure |2] In the 
graph, round nodes represent stochastic states, the only square node is a state of 
Player Max, s. The arrows represent the rules, with signed numbers representing 
the increments, and non-signed the probabilities. For example the arrow from s 
to r represents the rule (s, 0, r), whereas the right arrow from r to s represents the 
rule (r, +1, s), which has probability P(r, 1, s) = 2/3. 

Claim 1. If the rule (s, 0, t) is removed then Ya\(Term, (s, /)) = 2''. 

Proof. Observe that there is only one strategy when the rule above is removed. We 
will omit writing its name. Clearly P, Q JTerm) = 1 = 2°. Further, the assignment 

1 9 2 

x := P (s V) {Term) is the least non-negative solution of the equation x = ^ + =j-, 
which is ^. Finally, ¥ {s JTerm) = | • P (j j r) (Term) + | • P (j j+l) (Term) . Given the 
initial conditions for i = 0, 1, we obtain P, {) (Jerm) = 2~' as a unique solution of 
this recurrence. □ 

Claim 2. Va\(Term,(s, 1)) = f. 
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Proof. First we prove that Val(7erm, (s, 1)) > |. For any n consider the pure 
strategy, <x„, given for all histories ending in (s, z) by cr n (u)((s, i) — > (r, /)) = 1 if 
z < n and cr n (u)((s, z) — > z')) = 1 if i > n. Set 

Pi := P^/reach (s, /)) . 

Observe that p ; stays the same number if we define it using any cr n with n > i, 
and that 1 - p t = P^' 1} (terminate before reaching (s, z')) . Moreover, p\ = I and 

p i+ \ := I • (pi + (1 - p^ • p i+ i) . This uniquely determines that p { = Note that 
lim^oo pi = i. Finally, observe that 

^(Term) = (I - p n ) + p n ■ ^ . 

Thus, as n —> 00 the probability of termination under <x„ approaches |. 

Now we prove that Val(7erm, (s, 1)) < | by proving that P^ ^(terminate) < | 
for every cr. Consider the following probabilities: 

p a := P^ 1} (terminate without visiting t) , 
Pb '■= P^ ^(terminate after visiting t) , 

Pc ■= P" u) (visitO- 

Clearly p h = y . Due to the first Claim, applied to i = 1, we also have that p a <\- 
Finally, p a + p c < 1 since the events are disjoint. We conclude that 

1 113 

V[ s l) (Term) = p a + p b < p a + - • (1 - p a ) = - ■ p a + - < -. 

□ 

Claim 3. For all i > 0, Val(7erm, (s, z')) = |±f. 

Proof. The case z = is trivial, and z = 1 is by the previous Claim. Observe that 
Val(7erm, (5, z')) > ^ for all z, because there is always the transition to (t, z) from 
where the system terminates with probability |. Consequently, Val(7<?rm, (r, z')) > 
^ for all i as well. 

Thus, for a fixed z, either Val(7erm, (s, z')) = y or Val(7erm, (5, z')) > ^ 
In the first case, taking the transition (s, z) — > (r, z) is still value-optimal, i.e., 
Val(7erm, (5, z')) = ^ < Val(7erm, (r, z')). In the second case the transition 
(5, z) — > z) is not value-optimal, and thus the transition (s, i) —> (r, i) has to be 
value optimal. 
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Thus we know that (s, i) —> (r, i) always preserves the termination value, and 
we may unfold two steps of the Bellman-style equations satisfied by the value to 
obtain 

Given the initial conditions for / = 0, 1, we obtain Ya\(Term, (s, i)) = |Jf as a 
unique solution of this recurrence. □ 

Thus for all n > 1 we have Ya\(Term, ( s, n)) = 2~ (,,+1) ■ (2" + 1), and 
also, obviously, Ya\(Term,(t,n)) = 1/2. As a conclusion, Ya\(Term, (s, «)) > 
Ya\(Term,(t,n)). Thus no termination-optimal strategy may choose a transition 
generated by the rule (s, 0, f)- On the other hand, as shown in the first Claim, 
without the rule (s, 0, t) we would have YaMTerm, (s, n)) = 2~ n < 2' (n+l) ■ (2" + 1). 
Consequently, there are no termination-optimal strategies in (s, n). 

A.2. Reduction to Rising OC-MDPs 

Recall from Definition 13 .41 that a pure counterless strategy, cr, is called idling 
if there is a state q e Q, such that P^ 0) (3z' > : State (,) = q) = 1 and for all i > 0: 

0) (State (!) = q => C (0 = 0) = 1 . Also recall that a maximizing OC-MDP 3K is 
rising if there is no idling strategy for Jl and, moreover, Yal(LimInf(= -oo), q) < 1 
for all states q of Before we start proving Lemma 13. lOl let us prove an auxiliary 
result. 

Lemma A.2. Let w be a finite path of length n such that for all co e Run(w): 

• C (i) (a>) > (a>) for all i < n, and 

• if0<t<f <n and State (t \a>) = State (f) (a>) then C (t) (a>) > C (t '\a>). 

Then n<\Q\ 2 and max < ; <„ C (i) (a>) - C (0 \a>) < \Q\for all (O e Runiw). 

Proof. From the fact that the maximal positive counter change is +1 and the sec- 
ond property of w, we have that C (,) (a>) - C (0) (a>) < \Q\ for all i < n. Again by 
the second property, every control state is thus visited at most \Q\ times before the 
counter drops below C (0) . By the first property we now have n < \Q\ 2 . □ 

Proof of Lemma IXTOl We first intuitively explain the idea for the construction of 
tA: Using some added information in the control states, the OC-MDP will offer 
the following possibilities as long as a counterless strategy is chosen: either the 
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chosen counterless strategy somehow makes sure that after any state s is reached 
with positive probability the counter will thereafter either be decreased by at least 
one in at most \Q\ 2 steps with positive probability, or else after s is reached the 
play will be forced to enter the "trap" state with positive probability. The "trap" 
state is an extra absorbing state that keeps increasing the counter value forever 
thereafter. 

This, firstly, ensures that given a OC-MDP, M, the newly constructed OC- 
MDP, that is derived from it has no idling strategies. Secondly, the construction 
ensures the following: for every state q of the original OC-MDP, Jl, there is a 
corresponding state q of the newly constructed OC-MDP, J^, such that the optimal 
termination probability starting at configuration (q, i) in J?T is equal to the optimal 
terimation probability starting at configuration (q, i) in ^R. 

In more detail, the set Q of control states of & will consist of one special state 
"trap", and of multiple copies of states Q enhanced with two counters. These 
enhanced states are 3- and 5-tuples of the form (q, n, m), [q, n, m, k, r], where q e 
Q, (q, k, r) e A, < m < \Q\ 2 + 1 is a counter measuring the number of steps until 
it exceeds \Q\ 2 + 1, and < /i < |Q| + 1 is a counter measuring the difference of the 
current counter value minus the initial one, until it drops below or goes above 

\Q\. 

The triples and 5-tuples alternate in the transitions of Jl. First comes a triple, 
(q, n, m), indicating the current configuration of the simulation of a play in Jl. 
Then the player has to commit to an outgoing rule, (q, k, r), used on the short 
path (which it claims exists) which decreases the counter. This results in entering 
[q, n, m, k, r] . If q e Q T then the play must move in the next step to (r, n +k, m + 1 ). 
If q e Q P then all outgoing rules for q are used in the next step of the simulation, 
but the counters m and n are reset to for all steps following rules other than 
(q, k, r). Thus the next possible triples to visit are (r,n + k,m + 1), corresponding 
to rule (q, k, r), and states (r', 0, 0) corresponding to rules (q, k', r'), where (k', r') ± 
(k, r). The state in & corresponding to a q in is (q, 0, 0). Starting at state (q, 0, 0} 
the states along a run in & keep track of the number of steps and the change in 
counter value, and if the number of steps "overflows" before the counter decreases 
to -1, this indicates that the path selected by the player is not a short decreasing 
path, and the simulation is aborted by transiting to the trap state, which results in 
an incrementing self-loop. Otherwise, if within a short number of m of steps we 
reach a state [q', 0, m, q" , k], where m < \Q\ 2 , and the next transition decreases the 
counter (i.e., k = -1) then the two "internal counters" are reset to and we start 
all over again. 

We now give a formal definition of J^, which is an adaptation of a similar 
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contruction given in where it appeared as D'. The set of control states of S\ 
is 

Q = {trap} U {(q,n,m) \ q e Q,0 < m < \Q\ 2 + 1,0 < n < \Q\ + 1} 

{[q,n,m,k,r] \ (q,k, r) e A, < m < \Q\ 2 + 1,0 < n < + 1}. 

The stochastic states are <2p := {trap} U {[q,n,m,k,r] e Q}. The rules, A, is the 
smallest set containing 

{([q, n, \Q\ 2 + \,k,r\, 1, trap) | (q, k, r) e A, < n < \Q\ + 1} 

U {([q, \Q\ + 1, m, k, r], 1, trap) | fe, r) e A, < m < |Q| 2 } 

U {«2,n,m>,0, [?,n,m,A;,r]) | (?,*,r) eA,0<«< \Q\ + 1,0 < m < |Q| 2 + 1} 

U {([q,n,m,k,r],k,(r,n + k,m+ 1» |(^^r)eA,0<n< « + > 0,0 < m < \Q\ 2 } 

U {([q,n,m, k, r],k, (r, 0, 0» | (q, k, r) G A, n + k = -1, < m < |Q| 2 } 

U {(fe, n, m, fc, r], fc', (/, 0, 0» | (^, k', r') e A,q e Q P ,r ^ r,0 < n < \Q\, < m < \Q\ 2 } 

u {(trap, 1, trap)}, 

and also containing the rule (q, 1 , q) for each state not having an outgoing rule in 
the set above. Finally, P is derived from P as follows: for all q e Qp which only 
have one outgoing rule the probability of such rule is 1 . Otherwise we know q = 
[q,n,m,k,r], q e Q P and can set P(([q, n, m, k, r], k' , (r', n' , m'))) = P{{q,k',r')) 
for each ([q, n, m, k, r],k', (r' , n' , m')) e A. 

Clearly, ||^|| e <9(||J?l|| 4 ). For / we choose the function f(q) = (q,0,0). The 
remaining three properties of are delivered by Lemma IA.3I Lemma IA.41 and 
Lemma |A31 □ 

Lemma A.3. There are no idling strategies in A. 

Proof. By contradiction, assume there is a pure counterless idling strategy, cr. 
From the definition of idling, there is a control state q e Q which is al- 
most surely revisited under cr, and upon every revisit, the counter has the same 
value as at the beginning. For every state f visited from q, i.e., such that 
Q J 3i > : State w = r) > 0, we define the set of possible counter values seen 

at a visit from (q, 0) to f as C T := {c e Z | 3/ > : 0) (State (i) = f A C® = c) > 
0}. First we observe that \C f \ = 1 for all such r. Indeed, it has obvi- 
ously at least one element. On the other hand, if c,d e C f , c + d, then 
Wf- Q A3i > : State w = q A C (,) = c - d) > which contradicts our choice of q 
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because c - d + 0. From now on we denote by c T the only number such that 

Cf = {C f \. 

Now we choose r so that c ? is minimal. Observe that 0) (C W > j = 1 for 
all i > 0, otherwise there is a state, F, reachable under <x from f such that c f - < c?. 
But this means that a run from f under <x visits trap almost surely. It is easy to 
see that this implies that a run from q visits trap with a positive probability^ This 
contradicts cr being idling and q being the witnessing state for idling. □ 

Before we prove the second important property of we promised, we note 
that although technically it is not true that Q c <2, we may insert Q into Q by 
mapping q to (q, 0, 0). 

Lemma A.4. Val(7erm, (q, i)) = Val(7erm, ((q, 0, 0), i))for all q & Q and i > 0. 

Proof. The inequality Val(7erm, (q, /)) > Val(7erm, (((?, 0, 0), z)) is easy, because 
a strategy in J?l can simulate a strategy in (by "projecting" it onto states of Jl), 
except for the case when the run in reaches trap. But after reaching trap no run 
terminates, so the simulation in J{ may continue arbitrarily without producing a 
lower probability of termination. 

To prove Val(7erm, (q, /)) < Val(7erm, ((q, 0, 0), z)), we need to show that there 
are e-optimal strategies for M, for arbitrarily small e > 0, which can be simulated 
in while keeping the termination probability e-close to the original optimal ter- 
mination value in Jl. In the simulation we will use a natural correspondence of 
paths in to paths in given by dropping the odd steps and all additional infor- 
mation. As an example, the path {(q, 0, 0), 0) — > ([q, 0, 0, +1, r], 0) —> «r, 1, 1), 1) 
corresponds to (q, 0) — > (r, 1). 

In the proof, we give for every s > a pure strategy <x e , and a measurable set 
of runs, T s c Term in 3{, such that for all q e Q and i > 0: 

• P So (Te) - Val (7erm, (^r, 0) - e, and 

• for all finite paths u, len(u) = n, such that F^^(Run(u) n T £ ) > 0, there is 
some k, n < k < n + \Q\ 2 + 1 for which 

P So( cW < C(n) A VjX j < /c : < C 0) - C (,l) < |<2I I i?zm(«)) > 0. (6) 
Once we have proved the above, we can simulate the strategy cr £ in {R. 



7 Actually this probability is again 1, but we only need to know that it is positive. 
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Let us define the simulation in detail. Let u be a path from configuration 
{(q, 0, 0), i) in S\, ending in some configuration ((r, 0, 0), /), and w the correspond- 
ing path in 3K, ending in (r, f). Let n = len(u). If F^ ^(Runiu) n T £ ) = 0, 
then the rest of the simulating strategy after intial path u can be defined ar- 
bitrarily. For later reference we call u and all its extensions dead in this 
case. Otherwise let w be some extension of u witnessing ©, i.e., w of 
length k < n + \Q\ 2 + 1 with a prefix u, such that ¥^ j} (Run(w)) > and 

P^ (C W < C w A V j, n < j <k:0< C (J) - C (n) < \Q\ \ Run(w)) = 1. 

We fix a unique choice of such a w (which depends on u), and we define 
the simulating strategy in j\ for all histories v such that the path v in Jl which 
corresponds to v is an extension of u and a proper prefix of w. The definition is by 
induction on the length of v. Such a history v ends in some (s,h,m), where < 
m < \Q\ 2 + 1 and h > 0. Let (s, c) be the last configuration on v and (s, c) — > (s',c') 
the next step in w after completing v. Then the rule chosen with probability 1 
by the simulating strategy in A for the history v is ({s, h, m), 0, [s, h, m, c' - c, s']). 
Observe that due to our choice of w, the control state visited in the simulation after 
completing v and visiting [s, h, m, c' -c, s'] is either (a) (s', h + c' -c, m+ 1), where 
m + 1 < \Q\ 2 + 1 and h + c' - c > 0, or else (b) a state (s" , 0, 0), for some state 
s" 6 Q. In the former case (a) we continue with a new v as above. In the latter 
case (b), we are again back in a state of the form (r, 0, 0), and thus we need to find 
a new extension w' (unless now we are in a dead history) and start the process all 
over again. Because every history in the simulation is either dead, or ends in some 
((r, 0, 0), j), or is some short extension v of such a history which is not dead and 
ends in some state (r, 0, 0), as above, we have now defined the simulating strategy 
for every history in J^. 

Moreover, consider a path u in tR, which is not dead. Because we could not 
possibly hit trap in J{ before reaching a dead history, and because cr £ is pure, 
the probability of Run(u) in the simulation is ^^(Runiu)), where u is the path 
corresponding to u in SI. As a consequence, once we prove the existence of <t £ 
and validity of its properties, we have proven that the termination value in the 
simulation is at least Val(7erm, (q, i)) - e. Because e > can be chosen arbitrarily, 
this proves Val(7<?rm, (q, i)) < Val(7<?rm, ((q, 0, 0), /)). In the rest of the proof we 
show how to construct such a strategy <x £ in Jl for every e > . 

Let t > 0. By Term- 1 we denote the event that C ( ' = for some t' < t. 
By standard facts (see, e.g., 11161 Theorem 4.3.3]), for all t there is a pure strat- 
egy Tt optimal for Term-', i.e., such that for all q e Q, i > 0: F T ( ' q ^(Term-') = 
Ya\(Term-', (q, i)). Also, easily lim^oo Ya\(Term-', (q, /)) = Val(7erm, (q, ij), thus 
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for all s > there is t £ such that Va\(Term-' e , (q, i)) > Na\(Term, (q, i)) - s. We set 
T E := Term-' F . 

Let us fix an s > 0, and consider the corresponding t s . We now define cr E . 
Let u be a path in of length < t E , ending in a configuration (r, j). Pick the 
least t <t s -n such that Va\(Term-', (q, i)) = Va\(Term-' £ ~ n , (q, i)). Then cr E (u) = 
r,((r, j)). For w where len(u) > t E we define <x e (w) arbitrarily. Due to the Bellman- 
equation characterization of optimality for finite-horizon objectives, given, e.g., 
in [fl6l Section 4.3], we obtain that for all configurations (q, i): f^'^Term) > 
¥ < ^ i) (Term- tB ) = Va\(Termr ts ,(q, i)) > Va\(Term, (q, i)) - s, as required. 

For k > 0, let E k be the event that there are times t, t, k < t < f, such that 
State (0 = State (0 and < C (0 < C ( °. 

Claim 4. Let k > 0, and (q, i) be a configuration. Further, let u be an arbitrary 
path such that len{u) = k, P^(7?wn(«)) > and j) {Jerm- tF \ Run{u)) > 0. Then 
F£ A (E k | Run(u) n Term-'*) < 1. 

Proof. By contradiction. For k > t s the statement is obvious. Fix some k, < k < 
t E and (q, i). Assume that F^^E^ \ Run(u) n Term-' 5 ) = 1. Let w be an arbitrary 
extension of u such that len(w) = t E , Run(w) Q Term- ts , and P^ E . ) (i?wn(w)) > 0. 
Then clearly Run(w) c E k . This means that there are times t,t' ,k < t < t' < t £ such 
that Pj ;) (State w = State (f,) | Run(w)) = 1, and Pj (0 < C w < C (0 | Run(w)) = 1. 
Consider the prefixes w, w' of w of lengths t and t', respectively. There is some 
state r e Q, and counter values < j < f such that vt> ends in (r, j), and w' 
ends in (r, /). By the construction of cr E , there are h < t E - t and h' < t E - t' 
such that h' < h, Va\(Term- h , (r, j)) > Na\{Term- h -\(r, j)) > Va\(Term- h ' , (r, /)), 
and F^^Term-'c \Run(w)) = Va\(Term- h , (r, j)), ^(Term- 1 * \ Runiw')) = 
Va\(Term- h , (r, /)). In other words, on every extension of u which eventually sat- 
isfies Term- t£ there is a moment where the probability of Term- ts , conditionally on 
the current history, sharply decreases. This is in contradiction with the fact that 
<x £ is optimal wrt. Term- ts , and thus satisfies the Bellman optimality equations 



(cf. 116L Section 4.3]). 

□ 

Let us fix an arbitrary path u, len(u) = n, such that F^^Runiu) n Term-' s ) > 0. 
We apply Claim HI and obtain a witnessing extension, w, len(w) = m, of u so that 
^(Runiw)) > 0, Pj (C (m) = | Run(w)) = 1, and ¥^. } (E n \ Runiw)) = 0. By 
Lemma IAT21 this implies that there is some k, n < k < n + \Q\ 2 + 1 such that © is 
satisfied. Thus we proved all the required properties of cr E and T E = Term- ts , and 
the proof is done. □ 



33 



As a consequence, we obtain the last promised property of J^. 

Lemma A.5. IfVa\(LimInf(= -co), q) < 1 for all q & Q in J\, then !J{ is rising. 

Proof. By Lemma IA.3I there are no idling strategies in ^R. It remains to 
prove that Val(LimInf(= -oo),q) < 1 for all q e Q in First we prove 
it for q of the form (g, 0, 0). If Val(LimInf(= -oo),q) < 1 then there is 
i > such that Na\{Term,{q,i)) < 1, by, e.g., Lemma 14 of yfl. By 
Lemma |A4l we thus have Ya\(Term,(q,i)) < 1, and, thus again by Lemma 14 
of [0], Wal(LimInf(= -oo),q) < 1. If q = trap we are done immediately, as 
Va\(LimInf(= -oo), trap) = 0. Finally, in all remaining cases of q the play will 
almost surely reach some states from {trap} U {(q,0,0) | q e Q}. Because 
Limlnf(= -oo) is prefix independent, Va\(LimInf(= -oo), q) < 1 also in this case, 
and the proof is finished. □ 

A3. Bounds on N 

Here we derive an exponential upper bound on the value N, introduced in 
Section EJ Recall that, given a OC-SSG, M = (Q, A, P), and an e > 0, we want N 
to satisfy: 

Ya\(Term, (q, i)) - Ya\(LimInf(= -oo), q) < e for all q e Q and i > N. 

By results of Section [3T2l it suffices to consider only the case when Jl is a maxi- 
mizing OC-MDP. From Section [3TT1 we know that N := max{h, riog f ,(e • (1 - c))]}, 
where c = exp L ~ x ~^. J < 1 and h = rimaxli and x and z max are solutions to a 
linear program with coefficients polynomial in Thus there is a positive poly- 
nomial, p, such that c < e~ e P(tn)) and < e p(l|J?(|l) . If N < h we have clearly that it 
is exponentially bounded in ||J?T||. Otherwise 

ln(e) + ln(l-c) 

iV < f £ (c) := — . 

ln(c) 

Observe that f E (c) is growing with c — * 1~ and fixed e, because ^7- = e and 
grows with c — » 1~ and fixed i. Thus 



iV < f £ (c) < / e (e ) 

^ ln(g) + ln(l -e ) ,„..„, , ,,, . , ,, ,-,,11-1. 
-e-p(ll^ll) 



p(ii^ii)^ 

= • ln(l/e) - ln(l - e^"" 1 ") ■ e*». (7) 



Before we prove that this is indeed an exponential bound on N, let us prove 
two auxiliary claims. 
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Claim 5. For all n > the following inequality holds: 

Proof. We set din) := e" e ( " +1> - e -1 ~ e ". The inequality © is equivalent to din) < 
1 -e _1 . Because lim,,^ d(n) = 1 - e _1 , it suffices to prove that d(n) is increasing: 
Observe that 

d(n + l) - = (e - e ) - e • (e - e e ). (9) 

Also, because the exponential function e* is increasing and has increasing deriva- 
tion e* > 0, we know that 

e a - e b a - b 

— > for all a > b > c. 

e* - e c b — c 

In particular, setting a = -e~ ( " +2) , b = -e" (,,+l) , and c = -e~" yields 



__ e -(n+2) __ e -(n+l) 

> e 



e " - e _, 



_ e -«.+i) _ e - 



e~ e - e~ 

By ©, this implies d(n + l) > d(n) as required. □ 

Claim 6. For all n > the following inequality holds: 

n + 1 > -ln(l -e^"). (10) 

Proof. By induction. A direct computation for n = shows -ln(l - e~ e ) = 
- ln(l - e _1 ) < 0.46 < 1. Consider now n = k + 1 for some k > 0. Using © and 
the inductive hypothesis, we obtain 

(k+ 1) + 1 > -ln(l - e- e ~*) + 1 = -lnCe" 1 - e" 1 ^) > -ln(l - e^""). 

□ 

Finally, using COB in © we get N < e p(m) • ln(l/e) + (1 + p(\\M)) ■ z pim) . 
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